Harden RAG PDF ingestion#8
Conversation
|
Hi maintainers / @algora-pbc, quick visibility note for the Isaac/AimenGPT RAG bounty: this PR targets a distinct ingestion-hardening gap in the scientific RAG flow rather than duplicating the citation/reranking/context-budgeting PRs already open. Bounty reference: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu Current verification from
The main behavior change is making uploaded research PDFs with incomplete parser metadata ingest reliably instead of crashing before retrieval can happen. |
|
Follow-up pushed in Verification after the update from
|
|
Follow-up pushed in 0bc1bd2 to preserve citation metadata in scientific RAG chunks. What changed:
Verification:
|
|
Added a short demo video artifact for Algora/reviewer convenience:
This is supplemental evidence for review; the code and tests remain the source of truth. |
|
Updated the existing demo video with narrated voiceover explaining the ingestion-hardening changes, safer PDF handling, blank-chunk filtering, citation metadata preservation, and test coverage. The PR's existing demo-video link now points to the narrated MP4. |
Part of the open Algora bounty for
[ISAAC-497] Implement an enhanced RAG Pipeline for Scientific/Research Workflows./claim #45
Bounty reference: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu
Summary
Why this helps the scientific RAG bounty
Scientific PDFs often have incomplete or inconsistent parser metadata. The current upload path assumes
metadata.pdf.info.Title,metadata.source, andmetadata.loc.pageNumberare always present, so a single malformed parsed chunk can crash ingestion before the RAG pipeline can retrieve anything. This PR is a focused ingestion reliability slice that complements the existing citation/reranking/context PRs.Demo
Verification
From
ui/:npx vitest run __tests__/rag-ingest.test.ts npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts __tests__/rag-ingest.test.ts npx tsc --noEmit --pretty false npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file __tests__/rag-ingest.test.ts git diff --checkResults:
git diff --checkpassedAI-Assisted Disclosure
This contribution was produced with AI assistance and manually reviewed/verified before submission.